Diarisation

Welcome to our model card for Diarisation. This model card describes our currently deployed diarisation model available via our API.

Lelapa-X-Diarisation

Model Details

Basic information about the model: Review section 4.1 of the model cards paper.

Organization	Lelapa AI
Product	Vulavula
Model date	31 March 2024
Feature	Diarisation
Lang	Language Agnostic
Domain	N/A
Model Name	Lelapa-X-Diarisation
Model version	1.0.0
Model Type	Proprietary Model

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: Proprietary Model Tested on Audio Data

License: Proprietary

Contact: info@lelapa.ai

Intended use

Use cases envisioned during development: Review section 4.2 of the model cards paper.

Primary intended uses

Intended use is governed by the language and domain of the model. The model is intended to be used for audio diarisation. The model is not suitable for settings with multiple speakers (>4) and thus should be used with caution.

Primary intended users

The Diarisation model can be used for :

Audio Diarization in the call center domain
Improved ASR performance
Market Research and Analysis
Compliance monitoring for Customer Interactions

Out-of-scope use cases

The model is language invariant and cannot be used in settings different from telephonic or meeting recordings.

Factors

Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3: Review section 4.3 of the model cards paper.

Relevant factors

Groups:

Users who recorded utterances used to test the model are diverse across several factors such as age, location, and gender.
Performance across groups is underway.

Environmental conditions, Instrumentation and technical attributes:

Audio utterances are recorded in environments such as rooms, and call centers with a noiseless background.

Metrics

The appropriate metrics to feature in a model card depend on the model being tested. For example, classification systems in which the primary output is a class label differ significantly from systems whose primary output is a score. In all cases, the reported metrics should be determined based on the model’s structure and intended use: Review section 4.4 of the model cards paper.

Model performance measures

The model is evaluated using Diarization Error Rate (DER): the models’ performances are measured by automatic metric. DER It measures the fraction of time that is not attributed correctly to a speaker or to non-speech. DER quantifies the error rate in the diarization output by considering three types of errors:

Missed Speech: This occurs when the system fails to detect a segment of speech, resulting in a missing speaker label.
False Alarm: This happens when the system incorrectly assigns a speaker label to a segment that should not have a speaker label.
Speaker Error: This occurs when the system assigns the wrong speaker label to a segment.

The Diarization Error Rate is the sum of these error types, normalized by the total duration of the reference data (ground truth). The smaller the DER, the better the performance.

DER = Missed_Speech+False Alarm+Speaker Error / Total_reference_duration

Decision thresholds

No decision thresholds have been specified

Evaluation data

All referenced datasets would ideally point to any set of documents that provide visibility into the source and composition of the dataset. Evaluation datasets should include datasets that are publicly available for third-party use. These could be existing datasets or new ones provided alongside the model card analyses to enable further benchmarking.

Review section 4.5 of the model cards paper.

Datasets

Publicly available datasets in various languages
Proprietary call center dataset

Motivation

These datasets have been selected because they are open-source, high-quality, and cover the different languages. These help to challenge the model under different settings and languages and assess whether it can learn voice representations independently from the spoken language.

Training data

Review section 4.6 of the model cards paper.

N/A

Quantitative analyses

Quantitative analyses should be disaggregated, that is, broken down by the chosen factors. Quantitative analyses should provide the results of evaluating the model according to the chosen metrics, providing confidence interval values when possible.

Review section 4.7 of the model cards paper.

Unitary results

Dataset	AMI	VoxConverse	Proprietary Zulu	AliMeeting
Number of Speakers	[3, 4]	[2, 21]	2	[3, 10]
Avg Speakers	3.9375	6.47	2	4.7
Total Duration	8h 53mn	43h 15mn	49h 47mn	10h 44mn
Overlaps Duration	44mn	52mn	2h 51mn	43mn
Number of Overlaps	2851	4416	9863	4126
Proportion of Overlaps	8.34%	2.02 %	5.73 %	6.63 %
DER	8.69%	14.06%	2.91%	11.85%

Intersectional result

Our experiments show that:

High DER on other datasets could be due to challenges with speech overlap and the number of speakers.
Noise reduction impacts audio quality and DER, highlighting a trade-off between noise removal and accuracy.
The model shows potential for applications with a moderate number of speakers but may struggle with complex scenarios with many speakers.

Ethical considerations

This section is intended to demonstrate the ethical considerations that went into model development, surfacing ethical challenges and solutions to stakeholders. The ethical analysis does not always lead to precise solutions, but the process of ethical contemplation is worthwhile to inform on responsible practices and next steps in future work: Review section 4.8 of the model cards paper.

More details are in the datasheet.

Caveats and recommendations

This section should list additional concerns that were not covered in the previous sections.

Review section 4.9 of the model cards paper.

This model is limited by its limits in number of speakers between 3-5 max. The results on the benchmark datasets show that the model may not generalize well for datasets with a high amount of speakers.

Lelapa-X-Diarisation​

Model Details

Intended use​

Primary intended uses​

Primary intended users​

Out-of-scope use cases​

Factors​

Relevant factors​

Metrics​

Model performance measures​

Evaluation data​

Datasets​

Motivation​

Training data​

Quantitative analyses​

Unitary results​

Intersectional result​

Ethical considerations​

Caveats and recommendations​